Annotated Suffix Tree Method for German Compound Splitting

نویسندگان

  • Anna Shishkova
  • Ekaterina Chernyak
چکیده

The paper presents an unsupervised and knowledge-free approach to compound splitting. Although the research is focused on German compounds, the method is expected to be extensible to other compounding languages. The approach is based on the annotated suffix tree (AST) method proposed and modified by Mirkin et al. To the best of our knowledge, annotated suffix trees have not yet been used for compound splitting. The main idea of the approach is to match all the substrings of a word (suffixes and prefixes separately) against an AST, determining the longest and sufficiently frequent substring to perform a candidate split. A simplification considers only the suffixes (or prefixes) and splits a word at the beginning of the selected suffix (the longest and sufficiently frequent one). The results are evaluated by precision and recall.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Compact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth

Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even...

متن کامل

Graph Retrieval with the Suffix Tree Model

The paper in hand presents an adoption of the suffix tree model for the retrieval of labeled graphs. The suffix tree model encodes path information of graphs in an efficient way and so reduces the size of the data structures compared to path index based approaches, while offering a better runtime performance than subgraph isomorphism based methods. Within a specific use case we evaluate the cor...

متن کامل

Simple Compound Splitting for German

INTRODUCTION • Compound: concatention of two or more words Apfel|baum (apple tree) Apfel|kuchen|rezept|sammlung (apple cake recipe collection) • Productive word formation process → infinite amount of possible compounds • Compound splitting useful for many NLP applications – Statistical Machine translation: translation of new compounds, better lexical coverage – Information retrieval: better gen...

متن کامل

Annotated suffix trees for text modelling and classification

Suffix trees are compact and versatile data structures in which paths from the root to nodes represent substrings of the encoded text. By annotating such a tree with the frequencies of substrings, it is possible to construct a compact model of text that captures its sequential nature. This thesis investigates the use of such a model in the representation and classification of text. The basic ap...

متن کامل

A Method for Fast Approximate Searching of Polypeptide Structures in the PDB

The main contribution of this paper is a novel approach for fast searching in huge structural databases like the PDB. The data structure is based on an adaption of the generalized suffix tree and relies on an translationand rotation-invariant representation of the protein backbone. The method was evaluated by applying structural queries to the PDB and comparing the results to the established to...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016